Nature Machine Intelligence — Latest Matching Preprints

1

GeoEPred: A Multimodal Structure-Aware Geometric Deep Learning Framework for Gram-Negative Bacterial Secreted Effector Prediction with Sequence Semantics

Song, S.; Shi, H.; Wu, H.; Liu, D.; Lin, Y.; Mat Isa, N. A.; Zou, Q.; Wei, L.

2026-05-20 genomics 10.64898/2026.05.18.725929 medRxiv

Top 0.1%

54.5%

Show abstract

Accurate prediction of effector proteins secreted by Gram-negative bacteria is important for elucidating bacterial pathogenic mechanisms and developing precise anti-infective strategies. Although existing methods have benefited from the strong sequence feature extraction capacity of pretrained protein language models, reliance on linear sequence information alone often fails to fully capture the three-dimensional conformational signals required for virulence functions. Meanwhile, conventional structure-based methods are limited by the scarcity of experimentally resolved protein structures. To address these challenges, We propose GeoEPred, a multimodal deep learning framework designed for the synergistic modeling of protein sequence and structure to identify Gram-negative bacterial effector proteins. Specifically, the model integrates sequence-contextual embeddings from a pretrained protein language model with three-dimensional structural representations predicted by ESMFold. A feature projection network refines fine-grained sequence signals associated with effector functions, while geometric vector perceptrons characterize inter-residue orientations, distances, and local spatial topology to capture potential structural conformational motifs. To further enable effective cross-modal fusion, we design a cross-modal alignment and feature-tokenized self-attention module. This module enhances consistency between the sequence-semantic and structural-geometric spaces through contrastive learning and models associations between linear functional motifs and spatial conformational patterns at a fine-grained token level. Extensive evaluations on multiple benchmark datasets show that GeoEPred achieves better predictive performance than existing leading models in T3SE, T4SE, and T6SE prediction tasks, while maintaining stable performance in remote homolog recognition scenarios. Moreover, the modular and extensible architecture of GeoEPred demonstrates strong generalization ability and substantial application potential for genome-scale effector protein discovery. Author summarySecreted effector proteins are central virulence factors used by many Gram-negative bacterial pathogens to execute infection strategies. Their functions are governed not only by secretion signals and short linear motifs in the amino acid sequence, but also by three-dimensional folds, local domains, and surface geometric patterns. However, current predictors mainly exploit sequence-contextual features, limiting their ability to model the correspondence between linear sequence signals and spatial conformational motifs, and thereby constraining accuracy and interpretability. Here, we present GeoEPred, a multimodal deep learning framework for secreted effector protein identification. GeoEPred couples sequence-semantic embeddings from a pretrained protein language model with structural representations learned by geometric vector perceptrons. A cross-modal alignment and interaction module uses contrastive learning to improve functional consistency between sequence and structure modalities, while feature-token attention captures fine-grained links between key linear and conformational motifs. Across benchmark datasets covering multiple effector types, GeoEPred outperforms existing state-of-the-art methods and provides interpretable evidence from sequence fragments, structural regions, and cross-modal associations, supporting functional annotation, pathogenic mechanism analysis, and experimental validation.

2

Generating antimicrobial peptides via genomic transfer learning

Polloni, L.; Bieniasz, K. D.; Gonteri, I.; Frost, J. M.

2026-06-20 pharmacology and toxicology 10.64898/2026.06.16.732639 medRxiv

Top 0.1%

39.8%

Show abstract

We present a generative machine learning pipeline for the design of linear antimicrobial peptides (AMPs). To extend diversity beyond synthetically validated peptide datasets ([~]7,000 entries), we apply transfer learning by training a Generative Pre-trained Transformer (GPT) on the genomically derived AMPSphere dataset ([~]863,000 entries), before fine-tuning on the Database of Antimicrobial Activity and Structure of Peptides (DBAASP). We assess the filtered sequences with a committee of Minimum Inhibitory Concentration (MIC) predictive models built with a Bi-LSTM architecture, and ESM-2 and QSAR feature vectors. The fine-tuned GPT model produced a 28% reduction in test loss compared to training on DBAASP alone, and generates peptides that are simultaneously more novel and more physicochemically plausible. Our top-ranked candidates are predicted to possess antimicrobial activity comparable to polymyxin B. We anticipate this transfer-learning approach is broadly applicable for leveraging massive, unlabelled genomic datasets to enrich targeted peptide discovery. Our identified sequences have been submitted to the 2027 AMP Challenge1 (team name VINCI) for experimental validation, and the complete codebase and workflow are open source2.

3

Training-free Design of Deep Networks as Ensembles of Clinical Experts

Wu, T.; Chen, W.; Zhang, Z.

2024-03-18 health informatics 10.1101/2024.03.17.24304438 medRxiv

Top 0.1%

38.5%

Show abstract

Artificial intelligence (AI) techniques such as deep learning hold tremendous potential for improving clinical practice. However, clinical data complexity and the need for extensive specialized knowledge represent major challenges in the current, human-driven model design. Moreover, as human interpretation of a clinical problem is inherently encoded in the model, the conventional single model paradigm is subjective and cannot fully capture the prediction uncertainty. Here, we present a fast and accurate framework for automated clinical deep learning, TEACUP (training-free assembly as clinical uncertainty predictor). The core of TEACUP is a newly developed metric that faithfully characterizes the quality of deep networks without incurring any cost for training of these networks. When compared to conventional, training-based approaches, TEACUP reduces computation costs by more than 50% while achieving improved performance across distinct clinical tasks. This efficiency allows TEACUP to create ensembles of expert AI models, contributing to recommendations in clinical practice by mimicking the approach of using multiple human experts when interpreting medical data. By combining multiple perspectives, TEACUP provides more robust predictions and uncertainty quantification, paving the way for more reliable clinical AI.

4

Vision-Language Foundation Models Do Not Transfer to Medical Imaging Classification: A Negative Result on Chest X-ray Diagnosis

Fisher, G. R.

2025-12-08 radiology and imaging 10.64898/2025.12.06.25341759 medRxiv

Top 0.1%

38.3%

Show abstract

Vision-language models (VLMs) pretrained on web-scale data have achieved remarkable performance across diverse tasks, leading to widespread adoption in industry. A natural question is whether these powerful representations transfer to specialized medical imaging domains, and whether domain-specific medical pretraining improves transfer. We tested these hypotheses using two VLMs on the NIH ChestX-ray14 benchmark: Qwen2.5-VL (pretrained on web data) and BiomedCLIP (pretrained on 15 million PubMed biomedical image-text pairs). Both models dramatically underperformed compared to convolutional neural networks (CNNs) with ImageNet pretraining. Across 5 random seeds, the best VLM achieved F1=0.196 {+/-} 0.004 versus a CNN baseline of F1=0.811. Domain-specific pretraining provided marginal improvement: BiomedCLIPs frozen encoder achieved F1=0.161 {+/-} 0.001 versus Qwens F1=0.124 (+30%), but this remains clinically inadequate. Fine-tuning both models led to catastrophic overfitting, with sensitivity collapsing from >65% to <36% as the models learned to predict "no disease" for all inputs. These results demonstrate that neither general-purpose nor medical-specific vision-language pretraining produces features suitable for dense multi-label medical imaging classification. For chest X-ray diagnosis, traditional CNNs with ImageNet pretraining remain substantially more effective than VLM-based approaches.

5

Chemically informed representations of amino acids enable learning beyond the canonical protein alphabet

Christiansen, J. C.; Gonzalez-Valdes Tejero, M.; Hembo, C. S.; Li, Y.; Barra, C.

2026-03-16 bioinformatics 10.64898/2026.03.12.711352 medRxiv

Top 0.1%

34.4%

Show abstract

Computational models of proteins typically represent sequences using a fixed twenty-letter alphabet describing canonical amino acids. Although this symbolic representation underlies most machine learning approaches to protein analysis, it abstracts away the chemical structure of residues and cannot naturally encode post-translational modifications (PTMs). As a result, current models struggle to incorporate chemical variation beyond the canonical amino acid alphabet. Here we introduce a chemically informed representation of peptides based on two-dimensional depictions of amino acid structures. Peptides are encoded as mosaics of residue depictions and embedded using a convolutional autoencoder, allowing machine learning models to learn physicochemical features directly from molecular structure. Because these representations capture chemical properties rather than symbolic residue identities, they enable learning across structurally related residues and support generalization to modified amino acids not explicitly observed during training. Applied to Major Histocompatibility Complex class I binding prediction, these embeddings achieve competitive performance while enabling chemically interpretable attribution of the molecular features driving predictions.

6

SC-MAMBA2: Leveraging State-Space Models for Efficient Single-Cell Ultra-Long Transcriptome Modeling

Zhao, Y.; Zhao, B.; Zhang, F.; He, C.; Wu, W.; Lai, L.

2024-10-26 cell biology 10.1101/2024.09.30.615775 medRxiv

Top 0.1%

34.4%

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWThe rapid advancement of single-cell sequencing technology has significantly deepened our understanding of cellular heterogeneity, yet it concurrently presents substantial challenges for the unified modeling of single-cell data. Simultaneously, pre-trained foundation models have achieved notable success in domains such as natural language processing and image analysis. However, extending these models to accommodate ultra-long single-cell transcriptome sequences, characterized by an extensive number of genes, remains a formidable task. In this study, we introduce SC-MAMBA2, based on the MAMBA2 architecture, meticulously designed with a bidirectional modeling approach tailored for single-cell transcriptomics data. As the first single-cell foundation model to integrate state-space models (SSMs) underlying MAMBA2 architecture, SC-MAMBA2 features over 625 million parameters, covers more than 60,000 genes, and was pre-trained on a dataset of over 57 million cells, making it the most comprehensive solution for processing ultra-long transcriptome sequences. Extensive bench-marking across a diverse array of downstream tasks consistently demonstrates that SC-MAMBA2 surpasses state-of-the-art models, delivering superior accuracy and enhanced computational efficiency.

7

TabSyM: A Generative Pipeline for Small Multi-Cohort Omics Tabular Data

Yu, N.; Wang, Y.; Olsen, L. K.; Zhang, B.; Zhang, h.; Liu, Z.

2025-07-18 bioinformatics 10.1101/2025.07.14.664738 medRxiv

Top 0.1%

34.3%

Show abstract

Machine learning applications in biomedicine such as omics data analysis are frequently hindered by datasets that are small, high-dimensional, and affected by batch effects across different patient cohorts. To address these challenges, we introduce TabSyM, a modular generative pipeline that synthesizes high-quality, task-relevant data to improve predictive modeling. TabSyM integrates three key stages: it extends a diffusion-based model (TabDDPM) to generate new omics data, employs a novel task-aware sampling mechanism guided by Bayesian optimization to select the most informative synthetic samples, and uses a Multi-Domain Adversarial Network (MDAN) to align data distributions for cross-cohort generalization. We validated our pipeline on a challenging, real-world task of predicting 3-year survival in gastric cancer patients from high-dimensional scRNA-seq data across five cohorts. The full TabSyM pipeline achieved a 30.2% AUROC improvement over the best tree-based models and an 11.5% AUROC gain over leading automated machine learning frameworks. Furthermore, the generative and sampling components are model-agnostic and can substantially boost the performance of classical models like XGBoost independently. These results establish that combining generative modeling with task-aware sampling and domain adaptation provides a robust and effective strategy for overcoming critical data limitations in biomedical tabular data analysis.

8

Grounding olfactory perception in language: Benchmarks and models for generating natural language odor descriptions

Mascart, C.; Tran, K.; Samoilova, K.; Storan, L. T.; Liu, T.; Koulakov, A.

2026-03-05 animal behavior and cognition 10.64898/2026.03.04.709650 medRxiv

Top 0.1%

33.3%

Show abstract

Recent advances in deep learning have enabled prediction of odorant perception from molecular structure, opening new avenues for odor classification. However, most existing models are limited to predicting percepts from fixed vocabularies and fail to capture the full richness of olfactory experience. Progress is further limited by the scarcity of large-scale olfactory datasets and the lack of standardized metrics for evaluating free-form natural-language odor descriptions. To address these challenges, we introduce Odor Description and Inference Evaluation Understudy (ODIEU), a benchmark which includes perceptual descriptions of over 10,000 molecules paired with a model-based metric for evaluating free-form odor text descriptions. The model-based metric uses Sentence-BERT (SBERT) models which are fine-tuned on olfactory descriptions to allow better evaluation of human-generated odor texts. Using the fine-tuned SBERT models, we show that free-form text odor descriptions contain additional perceptual information in their syntactic structure compared to semantic labels. We further introduce CIRANO (Chemical Information Recognition and Annotation Network for Odors), a transformer-based model that generates free-form odor descriptions directly from molecular structure, thus implementing the molecular structure-to-text (S2T) prediction. CIRANO achieves performance comparable to humans. Finally, we generate human-like descriptions from mouse olfactory bulb neural data using an invertible SBERT model, yielding neural-to-text (N2T) predictions highly aligned with human descriptions. Together, CIRANO and ODIEU establish a standardized framework for generating natural language olfactory descriptions and evaluating their alignment with human perception. Code is available at https://github.com/KoulakovLab/ODIEU

9

RulePep: Interpretable ESM-Guided Neural-Symbolic Peptide Classification

Midjani, F.; Ghelich, R.; Keshtkar, F. Z.; Malekpour, M.; Lee, H.

2026-07-06 bioinformatics 10.64898/2026.07.03.736448 medRxiv

Top 0.1%

31.5%

Show abstract

Peptides are increasingly explored as therapeutic candidates, delivery vectors, and functional biomolecules, but experimental screening of peptide activity and safety remains costly because the sequence space is vast and small sequence changes can alter functionality. Computational peptide classification can therefore help prioritize candidates. However, many protein-language-model-based classifiers achieve strong performance using opaque prediction heads, making it difficult to determine which learned evidence supports or opposes a prediction. We present RulePep, an ESM-2-guided neural-symbolic classifier for peptide-function prediction. RulePep maps frozen ESM-2 sequence representation to learned latent predicates, polarity-constrained differentiable rules, and an additive symbolic logit whose components can be inspected at the case level. We evaluate RulePep on three biologically distinct peptide classification tasks: blood-brain barrier penetration, hemolytic potency, and anticancer activity. On the BBPpredict, HemoPI3, and AntiCP 2.0 alternate benchmark datasets, RulePep achieved AUROC/MCC values of 0.8869/0.6850, 0.9155/0.6820, and 0.9765/0.8633, respectively. Ablation experiments supported the contributions of multi-layer representation pooling, rule polarity, mined-rule initialization, symbolic capacity, and rule-derived aggregation. RulePep combines competitive predictive performance with additive logit reconstruction, rule-level evidence reporting, and predicate-suppression auditing, providing a transparent sequence-based framework for peptide candidate prioritization.

10

Integrative Transfer Network: Deep Transfer Learning Across Populations and Prediction Targets

Gao, Y.; Cui, Y.

2026-06-16 bioinformatics 10.64898/2026.06.12.731936 medRxiv

Top 0.1%

31.1%

Show abstract

Large-scale clinical and biomedical datasets increasingly contain both diverse subgroup attributes (e.g., demographic or clinical subgroups) and multiple prediction targets. Although various machine learning approaches can address subgroup differences or multi-target prediction, they often consider these aspects independently rather than jointly. To more effectively capture the shared and subgroup-specific information in such complex datasets, we propose the Integrative Transfer Network (ITN), a deep neural network designed to leverage data across subgroups and multiple related outcomes simultaneously. In extensive experiments, including time-to-event and classification tasks where demographic subgroups and multiple disease end-points are prevalent, ITN demonstrates consistent improvements in subgroup-specific prediction by borrowing strength from other subgroups and outcomes. We envision ITN as a unified frame-work for learning from heterogeneous datasets where subgroup-specific insights are critical.

11

Biological network-inspired interpretable variational autoencoder

Seninge, L.; Anastopoulos, I.; Ding, H.; Stuart, J.

2020-12-19 bioinformatics 10.1101/2020.12.17.423310 medRxiv

Top 0.1%

31.1%

Show abstract

Deep learning architectures such as variational autoencoders have revolutionized the analysis of transcriptomics data. However, the latent space of these variational autoencoders offers little to no interpretability. To provide further biological insights, we introduce a novel sparse Variational Autoencoder architecture, VEGA (Vae Enhanced by Gene Annotations), whose decoder wiring is inspired by a priori characterized biological abstractions, providing direct interpretability to the latent variables. We demonstrate the interpretability and flexibility of VEGA in diverse biological contexts, by integrating various sources of biological abstractions such as pathways, gene regulatory networks and cell type identities in the latent space of our model. We show that our model could recapitulate the mechanism of cellular-specific response to treatments, the status of master regulators as well as jointly investigate the cell type and cellular state identity in developing cells. We envision the approach could serve as an explanatory biological model in contexts such as development and drug treatment experiments.

12

SHOT-CCR: Biologically guided adversarial training for test-time adaptation in cellular morphology

Dee, W.; Wenteler, A.; Seal, S.; Morris, O.; Slabaugh, G.

2026-04-02 cell biology 10.64898/2026.03.31.715531 medRxiv

Top 0.1%

30.8%

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWPervasive batch effects are a common issue, especially in recent large-scale Cell Painting datasets, which have been produced to aid AI-enhanced drug discovery efforts. Technical differences arising from experiments carried out in different batches can cause models to fail to generalize to unseen batches, despite good predictive performance "within batch". We propose a biologically grounded test-time adaptation framework, SHOT-CCR, which uses cell-invariant gradient reversal to decouple morphological signal from experimental confounders. Our approach performs 4.5% better than the current RxRx1 benchmark, classifying 1,139 classes of siRNA genetic perturbations with 91.6% accuracy. We deliver consistent results over four distinct cell types and two prominent Cell Painting datasets - RxRx1 and a subset of JUMP-CP. Across 484 classes of CRISPR perturbations in JUMP-CP our method improves accuracy by 15.7%.

13

Flexible and Highly-Efficient Feature Perception for Molecular Traits Prediction via Self-interactive Deep Learning

Hu, Y.; Sirinukunwattana, K.; Li, B.; Gaitskell, K.; Bonnaffe, W.; Wojciechowska, M.; Wood, R.; Alham, N. K.; Malacrino, S.; Woodcock, D.; Verrill, C.; Ahmed, A.; Rittscher, J.

2023-08-05 pathology 10.1101/2023.07.30.23293391 medRxiv

Top 0.1%

30.7%

Show abstract

Predicting disease-related molecular traits from histomorphology brings great opportunities for precision medicine. Despite the rich information present in histopathological images, extracting fine-grained molecular features from standard whole slide images (WSI) is non-trivial. The task is further complicated by the lack of annotations for subtyping and contextual histomorphological features that might span multiple scales. This work proposes a novel multiple-instance learning (MIL) framework capable of WSI-based cancer morpho-molecular subtyping across scales. Our method, debuting as Inter-MIL, follows a weakly-supervised scheme. It enables the training of the patch-level encoder for WSI in a task-aware optimisation procedure, a step normally improbable in most existing MIL-based WSI analysis frameworks. We demonstrate that optimising the patch-level encoder is crucial to achieving high-quality fine-grained and tissue-level subtyping results and offers a significant improvement over task-agnostic encoders. Our approach deploys a pseudo-label propagation strategy to update the patch encoder iteratively, allowing discriminative subtype features to be learned. This mechanism also empowers extracting fine-grained attention within image tiles (the small patches), a task largely ignored in most existing weakly supervised-based frameworks. With Inter-MIL, we carried out four challenging cancer molecular subtyping tasks in the context of ovarian, colorectal, lung, and breast cancer. Extensive evaluation results show that Inter-MIL is a robust framework for cancer morpho-molecular subtyping with superior performance compared to several recently proposed methods, even in data-limited scenarios where the number of available training slides is less than 100. The iterative optimisation mechanism of Inter-MIL significantly improves the quality of the image features learned by the patch embedded and generally directs the attention map to areas that better align with experts interpretation, leading to the identification of more reliable histopathology biomarkers.

14

Robust Neural Networks are More Interpretable for Genomics

Koo, P. K.; Qian, S.; Kaplun, G.; Volf, V.; Kalimeris, D.

2019-06-03 genomics 10.1101/657437 medRxiv

Top 0.1%

30.5%

Show abstract

Deep neural networks (DNNs) have been applied to a variety of regulatory genomics tasks. For interpretability, attribution methods are employed to provide importance scores for each nucleotide in a given sequence. However, even with state-of-the-art DNNs, there is no guarantee that these methods can recover interpretable, biological representations. Here we perform systematic experiments on synthetic genomic data to raise awareness of this issue. We find that deeper networks have better generalization performance, but attribution methods recover less interpretable representations. Then, we show training methods promoting robustness - including regularization, injecting random noise into the data, and adversarial training - significantly improve interpretability of DNNs, especially for smaller datasets.

15

Accurate spatial quantification in computational pathology with multiple instance learning

Gao, Z.; Mao, A.; Dong, Y.; Wu, J.; Liu, J.; Wang, C.; He, K.; Gong, T.; Li, C.; Crispin-Ortuzar, M.

2024-04-26 pathology 10.1101/2024.04.25.24306364 medRxiv

Top 0.1%

27.1%

Show abstract

Spatial quantification is a critical step in most computational pathology tasks, from guiding pathologists to areas of clinical interest to discovering tissue phenotypes behind novel biomarkers. To circumvent the need for manual annotations, modern computational pathology methods have favoured multiple-instance learning approaches that can accurately predict whole-slide image labels, albeit at the expense of losing their spatial awareness. We prove mathematically that a model using instance-level aggregation could achieve superior spatial quantification without compromising on whole-slide image prediction performance. We then introduce a superpatch-based measurable multiple instance learning method, SMMILe, and evaluate it across 6 cancer types, 3 highly diverse classification tasks, and 8 datasets involving 3,850 whole-slide images. We benchmark SMMILe against 9 existing methods, and show that in all cases SMMILe matches or exceeds state-of-the-art whole-slide image classification performance while simultaneously achieving outstanding spatial quantification.

16

Generation of synthetic scRNA-seq-like transcriptomes using a generative adversarial network from RNA-seq data

Ruan, D.; Armstrong, S. S.

2025-10-10 bioinformatics 10.1101/2025.10.09.681449 medRxiv

Top 0.1%

26.9%

Show abstract

Next-generation sequencing (NGS) technologies have become integral for high-throughput transcriptomic studies. Among these, single-cell RNA sequencing (scRNA-seq) is especially valuable for quantifying gene expression at the individual cell level, enabling the identification of rare cell populations and cellular differentiation pathways. However, the high cost of scRNA-seq often limits its broader application. Bulk RNA sequencing (RNA-seq) provides a more affordable alternative but lacks the single-cell resolution needed to elucidate cellular heterogeneity. Here, we present a cycle-consistent generative adversarial network (cycleGAN) approach to generate synthetic single-cell-like transcriptomes from bulk RNA-seq data. By adversarially training two sets of generators and discriminators, our framework attempts to learn the relationship between bulk and single-cell transcriptome distributions. Although this approach does not replace real scRNA-seq experiments, it can be a usefult tool to generate synthetic single-cell-like data for preliminary exploratory investigations and other machine learning applications. We further discuss the performance, limitations, and ethical considerations of our method.

17

HistoSB-Net: Semantic Bridging for Data-Limited Cross-Modal Histopathological Diagnosis

Bai, B.; Shih, T.-C.; Miyata, K.

2026-03-26 pathology 10.64898/2026.03.23.713838 medRxiv

Top 0.1%

26.9%

Show abstract

Vision-language models (VLMs) provide a unified framework for multimodal reasoning, yet their representations are primarily learned from natural image-text corpora and often exhibit semantic misalignment when transferred to histopathology, particularly under data-limited diagnostic settings. To address this limitation, we propose HistoSB-Net, a semantic bridging network designed to adapt pre-trained VLMs to multimodal histopathological diagnosis while preserving their original semantic structure. HistoSB-Net introduces a constrained semantic bridging (CSB) module that operates within the self-attention projection space of both vision and text encoders. Instead of employing explicit cross-attention or full fine-tuning, CSB adaptively modulates pre-trained attention projections through a lightweight nonlinear semantic bottleneck, enabling structured cross-modal regulation with limited additional parameters. The framework supports both patch-level and whole-slide image (WSI)-level diagnosis within a unified architecture. Experiments on six pathology benchmarks, comprising two WSI-level and four patch-level datasets, demonstrate consistent improvements over zero-shot inference across 36 backbone-dataset combinations under limited supervision. Further analysis of prototype-based margin distributions and confusion matrices shows that these improvements are accompanied by enhanced intra-class compactness and increased inter-class separation in the embedding space. These results indicate that CSB provides an effective and computationally manageable strategy for adapting pre-trained VLMs to data-limited digital pathology tasks.

18

Sequence-Based Therapeutic Peptide Classification with Augmented Negative Sampling

Ellerbrock, R.; Valentini, A.; Paul, A. C.; Mukhopadhyay, S.; Perelshtein, M. R.

2026-06-11 bioinformatics 10.64898/2026.06.07.730473 medRxiv

Top 0.1%

26.8%

Show abstract

Therapeutic peptides offer high target specificity, low toxicity, and the ability to modulate protein-protein interactions, yet experimental functional characterization remains costly and slow. Computational prediction of therapeutic function directly from sequence could accelerate peptide screening and enable generative design pipelines, but requires reliable discrimination between therapeutic and non-therapeutic peptides. Existing multi-label predictors cover few functions, rely on limited datasets, and exhibit high False Positive Rates (FPRs), limiting their practical utility. We present a lightweight CNN classifier trained on the most comprehensive therapeutic peptide database to date (54,655 peptides, 48 functional categories). A key contribution is a statistically motivated negative sampling strategy using Markov models to generate diverse synthetic decoys at multiple difficulty levels. When evaluated on this controlled decoy benchmark, the FPR is reduced from over 60% for previous models to 2.1% for our approach. On positive therapeutic samples, our fine-tuned five-model ensemble achieves 79.9% Micro F1 and 54.6% Macro F1 while requiring only amino acid sequences as inputs. Analysis using a sparse L1-constrained variant of our model shows that convolutional filters capture conserved functional motifs and statistically improbable non-therapeutic patterns, with downstream layers combining these signals, providing mechanistic evidence that the network learns biologically meaningful structure. On an external generalization benchmark derived from TPpred-LE, our model achieves 55.3% Micro F1 and 38.6% Macro F1 on the 12 shared labels, close to the benchmark-specific baseline (57.9%/38.1%), while retaining substantially broader therapeutic label coverage. Code and models will be made available at https://github.com/terra-quantum-public/tq-therapep-ai.

19

TCRfinder: Improved TCR virtual screening for novel antigenic peptides with tailored language models

Li, Y.; Zhang, C.; Zhang, X.; Zhang, Y.

2024-07-01 bioinformatics 10.1101/2024.06.27.601008 medRxiv

Top 0.1%

26.6%

Show abstract

Accurate modeling of T-cell receptor (TCR) and peptide interactions is essential for immunoreaction elucidation and T-cell-based immunotherapeutic developments. We developed TCRfinder, a novel deep-learning architecture for TCR-peptide binding prediction and virtual screening. Large-scale benchmark experiments demonstrated a robust capability of TCRfinder in distinguishing interacting and non-interacting TCRs for unseen peptides, with accuracy significantly beyond current state-of-the-art methods. Furthermore, TCRfinder recognizes tumor neoantigen mutations from wild-type antigens of given TCRs, with a success rate nearly 50% higher than the best of existing methods. Detailed data analyses showed that the major advantage of TCRfinder lies in the specially trained TCR and peptide language models tailored with iterative attention network architecture, which can precisely reveal physical interaction patterns of cross-chain atoms and substantially enhance the precision of TCR-peptide interaction predictions. The open-source TCRfinder program can help facilitate large-scale deployment of high-quality TCR and neoantigen virtual screening, offering exciting potential for personalized TCR-based immunotherapies.

20

Simulating drug effects on blood glucose laboratory test time series with a conditional WGAN

Yahi, A.; Tatonetti, N. P.

2020-07-21 health informatics 10.1101/2020.07.19.20157321 medRxiv

Top 0.1%

26.5%

Show abstract

The unexpected effects of medications has led to more than 14 million drug adverse events reported to the Food and Drug Administration (FDA) over the past 10 years in the United States alone, with a little over 1.3 million of them linked to death, and represents a medical and financial burden on our healthcare. Laboratory tests have the potential to capture inter-individual variability in drug responses, but a significant portion of the patient population has unique treatment pathways that impedes forecasting and optimal decision making. Generative Adversarial Networks (GANs) are flexible implicit generative models that have demonstrated their ability to capture complex correlations in field like computer vision and natural language. Their latent representation capacity is an opportunity for drug effect simulation on laboratory test trajectories. In this paper, we developed and evaluated conditional GANs on glucose laboratory tests in patients exposed to drug combinations and showed a proof of concept for these models in the simulation of unseen drug combinations. By using conditional Wasserstein GANs (WGANs) to simulate drug effects in laboratory tests, we hope to pave the way for novel clinical decision support (CDM) systems and enable the development of better predictive models for rare cohorts of patients.